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This paper studies a very flexible model that can be used widely 
to analyze the relation between a response and multiple covariates. 
The model is nonparametric, yet renders easy interpretation for the 
effects of the covariates. The model accommodates both continuous 
£-H . and discrete random variables for the response and covariates. It is 

■ quite flexible to cover the generalized varying coefficient models and 

the generalized additive models as special cases. Under a weak con- 
dition we give a general theorem that the problem of estimating the 
multivariate mean function is equivalent to that of estimating its uni- 
variate component functions. We discuss implications of the theorem 
for sieve and penalized least squares estimators, and then investigate 
the outcomes in full details for a kernel-type estimator. The kernel 
estimator is given as a solution of a system of nonlinear integral 
equations. We provide an iterative algorithm to solve the system of 
equations and discuss the theoretical properties of the estimator and 

■ the algorithm. Finally, we give simulation results. 

■ 1. Introduction. The varying coefficient regression model, proposed by 
£Nj . Hastie and Tibshirani (1993), and studied by Yang et al. (2006), Roca- 

Pardinas and Sperlich (2010) and Lee, Mammen and Park (2012), is known 
to be a useful tool for analyzing the relation between a response and a mul- 
tivariate covariate. For a response Y and covariates X and Z, they assumed 

5^1 1 that the mean regression function, m(x, z) = ^(YjX = x, Z = z), takes the 
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form m(x, z) = x\fi(zi) H + x d fd(zd) for some unknown univariate func- 
tions fj. The model is simple in structure, gives easy interpretation, and 
yet is flexible since the dependence of the response variable on the covari- 
ates is modeled in a nonparametric way. The major hurdle in the practical 
application of this model is that one needs to pair up each "X-covariate" 
with only one "Z-covariate." Typically this is not the case in practice. In 
principle, each X-covariate may interact with any number of Z-covariates 
to explain the variation in Y. Moreover, it is often difficult to differentiate 
X-types from Z-types in a group of the covariates. 

In this paper we are concerned with quite a more flexible setting than 
the usual varying coefficient model. Suppose that we are given a group of 
covariates, Xi, . . . ,Xr>. Let X = {X\, . . . , Xe>) t and d be an integer such 
that d < D. The model we are interested in assumes that there is a link 
function, say g, such that the mean function m(x) = E(Y\X. = x) satisfies 

(1.1) s(m(x)) =xi{^2 hk(xk)) H \-Xdi ^2 fdk{x k )j, 

where the index sets Ij C {1,2, ... ,D} are known, and each Ij does not 
include j. The covariates that enter into one of the coefficient functions fjk, 
that is, Xk for k £ C = Uj=i Ij are °f continuous type. For simplicity, we 
assume that Xk with k G C are supported on the interval [0, 1] . We allow 
some of the covariates Xj, for 1 < j < d, to be discrete random variables. 
We also allow that C and {1,2,..., cJ} may have common indices. Let Co = 
C D {1, 2, . . . ,d}. The case Co = 0, that is, C = {d + 1, . . . , D}, corresponds to 
the situation where one can distinguish between two groups of covariates, 
{Xi,...,X d } and {Z ± , . . . , Z p } = {X d+1 , . . . , X d+p } with D = d + p. In this 
case, the model reduces to 

(1.2) g(m(x,z)) =xt( ^ /lfcfoOj H ^ x di/~2 fdk(zk)), 

where Ij C {1,2, ... ,p} are index sets of Z^. The latter model arises, for 
example, when one takes all Xj, for 1 < j < d, to be discrete covariates. 
The above model reduces further to the nonparametric generalized additive 
model of Yu, Park and Mammen (2008) if we take d = 1 and X\ = 1. 

The functions fjk in the representation (1.1) are not identifiable. To see 
this, consider the case where d = D = 3, I\ = {2, 3}, I2 = {3}, ^3 = {2} so that 
C = C Q = {2,3}. In this case, x\ [fiiix-}) + /i 3 (x 3 )] + x 2 /23(^3) + £3/32(2:2) = 
xi [312(^2) +513(^3)] +X2923(x3) +£3332(22), if 312 (^2) = fi2(x 2 ) + c, g 13 (x 3 ) = 
fi3(x 2 ) - c, 523 (£3) = /23(£3) + x 3 and #32(2:2) = jfo^) - x 2 for some con- 
stant c. To make all fjk identifiable, we use the following constraints: 

fjk(xk)wk(xk)dx k = 0, keC,l<j <d, 
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(1.3) 

J x k fjk(x k )wk(x k )dx k = 0, j,keC 

for nonnegative weight functions w k , where Co = {1,2, . . . , d}C\C. With these 
constraints we may rewrite model (1.1) as 

d d / \ 

(1.4) 5 (m(x))=^ 

a j X j / j a jkXjX k -+ 
j=l j<k j=l kelj ' 

j,kec a 

We think that our approach broadens the field of applications of varying 
coefficient models essentially. The link function allows us to have a discrete 
response. Furthermore, our model frees us from the restrictive settings of the 
usual varying coefficient model that one should differentiate between two 
types of covariates, X- and Z-type as in (1.2), and that each covariate ap- 
pears in only one "nonlinear interaction term." In the case where X±, . . . , X q , 
for some q < D — 2, are discrete and the remaining covariates are continu- 
ous random variables, our approach allows fitting the full model, where 
Ij = {q + 1,...,D} for all l<j<q and I j = {q + l,... ,j - 1, j + 1,...,D} 
for q + 1 < j < D: 

g(m(x))=x 1 l ^2 hk(xk) J H \-x q i ^ f P k{x k ) J 

\fc=g+l / \fe=g+l / 

(1-5) 

+ Xg+i( ^ fq+l,k{x k ) J H hX D l ^ fDk{x k )\. 

\k=q+2 J \k=q+l J 

One may fit the full model in an exploratory analysis of the data, and find 
a parsimonious model, deleting some of the functions fj k in the full model, 
that fits the data well. 

We stress that fitting model (1.1) is not more complex than fitting other 
varying coefficient models such as g(m(x)) = xi/i(xd+i) + • • • + Xdfd(x2d)- 
The complexity is only in notation and theory, yet it gives full flexibility in 
modeling via varying coefficients. Think of the case where the true model 
is g(m(x)) = X1/12OE2) +^2/21(^1)- I n this case, p(m(x)) may not be well 
approximated by either £1/12(^2) or £2/21(^1) alone. Each additive term in 
g(m(x)) is interpreted as a (linear) x (nonlinear) interaction. With X2 being 
held fixed, modeling by £1/12(^2) alone, for example, reflects only the linear 
effect of X\, while modeling by xi/i2(^2) +£2/21(^1) accommodates the 
nonlinear effect of X\ as well. 

Xue and Yang (2006) discussed a special case of model (1.2) where one can 
differentiate between X-type and Z-type variables and all Ij = {1, 2, . . . ,p}. 
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They proved that the functions fjk with the constraints (1.3) are identifiable. 
The essential assumption was that the smallest eigenvalue of i?(XX T |Z = z) 
is bounded away from zero where X = (X\ , . . . , Xd) T and Z = (Z\ , . . . , Z p ) T , 
although they put a stronger one; see their condition (C2). Their approach 
cannot be extended to our model (1.1). To see this, consider the model (1.5) 
and think of £(XX T |X C = x c ) where X = (X u . . . , X D ) T and X c = (X q+1 , 
Xd) t . Certainly, the matrix is singular for all x c if D — q > 2. One of our 
main tasks in this paper is to relax the requirement that the smallest eigen- 
value of £(XX T |X C = x c ) is bounded away from zero; see assumption (AO) 
in Section 2. This weaker condition is typically satisfied. In Lemma 1 in the 
Appendix, we show that, under the weaker condition, the L2-norms of m and 
of the function tuple (/,& :k G J, , 1 < j < d) in the model (1.1) are equiva- 
lent, modulo the norming constraints (1.3). This has important implications 
in estimating the model (1.1). First, it implies that the functions fjk are 
identifiable. For some types of estimators of m, it also gives directly the 
first-order properties of the corresponding estimators of fjk- In the next sec- 
tion, we discuss the implications. In the subsequent sections, we focus on the 
smooth backfitting method of estimating the model (1.1), where we also use 
the main idea contained in Lemma 1 to derive its asymptotic properties. The 
original idea of smooth backfitting was introduced by Mammen, Linton and 
Nielsen (1999) for fitting additive models, and it has been further developed 
in other contexts; see Yu, Park and Mammen (2008) and Lee, Mammen and 
Park (2010, 2012), for example. 

Earlier works on varying coefficient models focused on the model m(x, z) = 
xi/i(z) + • • • + Xdfd( z ), where a single covariate Z (univariate or multivari- 
ate) enters into all coefficient functions. This model was proposed and stud- 
ied by Chen and Tsay (1993), Kauermann and Tutz (1999), Fan and Zhang 
(1999, 2000), Cai, Fan and Li (2000), Cai, Fan and Yao (2000) and Fan, 
Yao and Cai (2003). Mammen and Nielsen (2003) added a link function to 
this model: g(m(x, z)) = x±fi(z) H \-Xdfd(z)- The case where fj are time- 
varying was treated by Hoover et al. (1998), Huang, Wu and Zhou (2002, 
2004), Wang, Li and Huang (2008), and Noh and Park (2010). Heim et al. 
(2007) also considered the case where all coefficient functions are defined on 
a single 3D spatial space. Fitting these models is simple. A univariate or 
multivariate smoothing across the single variable Z, or on a time scale, or 
on a multidimensional spatial space, gives directly estimators of fj without 
further projection (by marginal integration or backfitting, e.g.) onto relevant 
function spaces. However, this suffers from the curse of dimensionality when 
the dimension of Z is high. For this reason, most works were focused on 
univariate Z. Some time series models related to the functional coefficient 
model, with Xj being unobserved common factors that depend on time, have 
been proposed and studied by, for example, Fengler, Hardle and Mammen 
(2007) and Park et al. (2009). 
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2. Equivalence in entropies of function classes. In this section we will 
show that the nonparametric components, fj k : k € Ij, 1 < j < d, of our mod- 
el (1.4) with constraints (1.3) can be estimated with a one-dimensional non- 
parametric rate. This means that our model avoids the curse of dimension- 
ality. It is easy to check that the function m(x) can be estimated with a one- 
dimensional rate. This follows by application of results from empirical pro- 
cess theory; see below. We will use that the L 2 -norms of m and of the tuples 
(a; fjk : 1 < j < d,k G Ij) are equivalent; see Lemma 1 in the proofs section. 
Here, a denotes the vector with elements ay : 1 < j < d, aj k :j < k,j, A; € Co- 
Our next result uses this fact to show that the rate for the estimation of m 
carries over to the estimation of (a; fj k : 1 < j < d, k 6 Ij). 

In the description of our method and in our theory we will also make use 
of a different representation of the model (1.4). In this representation of the 
model, we collect those coefficients that are functions of the same continuous 
covariate and put them together as an additive component. Suppose that, 
among X±, . . . ,Xj in model (1.4), there are r (0 < r < d) variables whose 
indices do not enter into C. Without loss of generality, we denote them by 
X\, . . . ,X r . Let p = D — r > 2 be the number of indices in C. Thus C = 
{r + 1, . . . ,r +p} and Co = {r + 1, . . . ,d}. Define 

(2.1) it k = {Xj:r + k£ Ij,l<j <d}, l<k<p. 

The vector X/% is the collection of all Xj, for 1 < j < d, that have interactions 
with X r+ k in the form of Xjfj^ r+ k{X r+ h). Thus, does not include X r+k as 
its element. Let dk denote the number of the index sets Ij that contain r + k. 
Thus, Xfc is of cifc-dimension. Likewise, for a given vector x, we denote the 
above rearrangements of x by x^., 1 < k < p. Also, define ffc = {fj^ r+ k :r + k£ 
Iji 1 < J ' < d} for 1 < k < p. Then model (1.4) can be represented as 

d 

(2.2) ff(m(x)) = y^ j a j x j + ^ a jk XjX k + 5i[fi(x r+1 ) H \-xJf p (x r+p ). 

j=l j<k 

j,kec 

To give an example of the above representation, consider the case where 
d = D = 3, h = {2, 3}, h = {3}, h = {2} so that C = C = {2, 3}. In this 
case, r = 1, xi = (x 1 ,x 3 ) T , x 2 = (xi,x 2 ) T , fi = (/i2,/32) T , h = (/i3,/23) T , 
and thus 

xi[fi2(x 2 ) + f 13 (x 3 )) +^2/23(2:3) + ^3/32(^2) = x7fi(x 2 ) + x 2 r f 2 (x 3 ). 
Suppose now that one has an estimator m of m with 

d d , 

(2.3) g(m(x))=Y^ a j X 3 / j QjkXjXk + 

j=i j<k j=i \eij ' 

j,kec 
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where the estimated functions fjk satisfy the constraints in (1.3). We make 
the following assumption: 

(AO) It holds that the product measure Yif=i has a density w.r.t. the 
distribution Px of X that is bounded away from zero and infinity on the 
support of Px- Here, Px is the marginal distribution of Xj. The marginal 



distributions are absolutely continuous w.r.t. Lebesgue measure or they are 
discrete measures with a finite support. Furthermore, the weight functions 
Wj for j G C in the constraints in (1.3) are chosen so that Wj/pXj is bounded 
away from zero and infinity on the support of Pxj ■ Here, pxj is the density 
of Xj. The smallest eigenvalues of the matrices P[XfcXj|A r _|_fc = Zk] for 
1 <k <p are bounded away from zero for Zk in the support of Px r+k ■ 

The condition on E[~Kk~X.J,\X r+ k = Zf,] in (AO) is typically satisfied. For 
example, consider the model (1.5), where X\, . . . ,X q are discrete random 
variables whose indices do not enter into C. Thus, r = q and p = D — q. Ac- 
cording to configuration (2.1), we get X fc = (Xj : 1 < j < D, j ^ q + k) T which 
is the covariate vector X without X q+ k- In this case, PfXfcXjl A r+ fc = Zk] is 
positive definite if the support of the conditional distribution Px fe |x k=Zk 
contains (D — 1) linearly independent vectors. 

We get the following theorem for the rate of convergence of the compo- 
nents of m. 

Theorem 1. Suppose that assumption (AO) applies, and that an esti- 
mator to of to with (2.3) and (1.3) satisfies that, for a null sequence K n , 

(2.4) j [ 5 (to(x)) - <7(TO(x))] 2 P x (dx) = O p (n 2 n ). 

Then it holds that 

'jjk(xk) - fjk(xk)fpx k (xk) dx k = Op(nl) 
for all k € Ij,l < j < d. 

It is easy to construct estimators that fulfill (2.4). We will discuss this for 
the case where i.i.d. observations (X l ,y J ) are made on the random vector 
(X, Y) with X* = (X\, . . . , X l D ) T of P^-dimension. Examples for estimators 
that fulfill (2.4) are sieve estimators or penalized least squares estimators. If 
one makes entropy conditions on some function classes J~jk, then Theorem 1 
can be used to show that the entropy conditions carry over to the class M. = 
{to:<7(to(x)) has the structure (1.4) for some a £ A and fjk £ JFj^k £ Ij, 
1 < j < d} for some compact set A. Using empirical process methods one 
can then show that sieve estimators or penalized least squares estimators 
fulfill (2.4). Below we outline this for the case where J-jk are the classes of 
/-times differentiable functions for some I > 2. 
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The penalized least squares estimator (a PEN ; /™ :l<j<d,k€lj) min- 



Jjk 

imizes 



n 



-1 E ^-f 1 E 



i=l K \j=l j<k 

(2.5) 



2 



j=i keij J ) 

where we put J(f) = £ fce/l / ^i/lfcO*) 2 dz + - ■ - + J2 keId f D l J dk (z) 2 dz. Here, 
the functions /™ N are chosen so that (1.3) holds, and for a function g, D l z g 
denotes its Zth order derivative. We get the following result for the rate of 
convergence of the penalized least squares estimators ff k EN - 

Corollary 1. Suppose that all assumptions of Theorem 1 hold, that 
the link function g has an absolutely bounded derivative, and that the esti- 
mators fj™ of fjk are defined by (2.5). Suppose that, for k £ Ij,l < j < d, 
the functions fjk have square integrable derivatives of order I. Furthermore, 
assume that the (conditional) distribution of £i = Yi — m(X t ), 1 < i < n, has 
subexponential tails. That is, there are constants t e ,c e > such that 

sup ^[exp^DlX 1 ,...^"] <c £ 

l<i<n 

almost surely for \t\ < t £ . Choose \ n such that A" 1 = O p {n l ^ 2l+l ^) and X n = 
Op(™~ i/(2m) ). Then it holds that 

J lfjr(z) - f 3k {zf PXk {z)dz = O p (n-^% 



I 



D l Jf k EN (zfdz = O p (l) 



Jjk u " 6 — 

for all k £ < j < d. 

We now discuss sieve estimation of m. We will do this for spline sieves. 
Denote by Q njC the space of all spline functions of order I with knot points 0, 
L~ ,2L~ , . . . , 1 and with Zth derivative absolutely bounded by c. The spline 
sieve estimator (a IEVE • /^J EVE : 1 < j < d,k € Ij) minimizes 

n r / d d \ ^ 2 

(2.6) n^Y. \Yi-g- 1 5>j*j+ E^ X J X *+E X ;E^™ 

i=l I \j=l j<k j=l fcG/j / J 

j,kec 
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over all functions fjk in Q n c . Again, the functions ff k IEYE are chosen so 
that (1.3) holds. We get the following result for the rate of convergence of 
the sieve estimators /jf ■ 

Corollary 2. Suppose that all assumptions of Theorem 1 hold, that the 
link function g has an absolutely bounded derivative and that the estimators 
/ffc EVE °f fjk are defined by (2.6). Suppose that, for k G Ij, 1 < j < d, the 
functions fjk have derivatives of order I that are absolutely bounded by c. 
Furthermore, assume that -B|e| 7 < oo holds for some 7 > 2 + l~ l . Choose L n 
such that L- 1 = 0(n _1 /(a+ 1 )) and L n = 0(n x /( 2 ' +1 )). Then it holds that 

J lfT YE (z) - f jk (z)] 2 p Xk (z) dz = O p {n-*l^)) 
for all k € < j < d. 

Both results, Corollaries 1 and 2, can be generalized to quasi- likelihood 
estimation. Then, the estimators are defined as in (2.5) or (2.6), respectively, 
but with the squared error (y — <? _1 (u)) 2 replaced by Q(g~ 1 (u),y). For the 
definition of Q, see the next section. One can show that the results of Corol- 
laries 1 and 2 still hold under the additional assumptions (Al) and (A2) 
of Mammen and van de Geer (1997). This can be proved along the lines of 
arguments of the latter paper. In the subsequent sections, we discuss kernel 
estimation of the model (1.1). 

3. Estimation based on kernel smoothing. We will introduce a kernel 
estimator based on backfitting and develop a complete asymptotic theory 
for this estimator. Again, we will do this for the case where i.i.d. observations 
(X*, Y l ) are made on the random vector (X, Y). Model (1.1) can be rewritten 
as (1.4) with constraints (1.3). It can be shown that the parameters aj 
and otjk in (1.4) can be estimated at a faster rate than the nonparametric 
functions fjf.. For this reason we neglect the parametric parts for simplicity 
of presentation. Thus we consider model (1.1) with the constraints (1.3). In 
this setting, our alternative representation (2.2) becomes 

(3.1) g(m(x)) =xjfx(x r+1 ) H hxjfp(x r+p ). 

Let Q be the quasi-likelihood function such that dQ(fj>,y)/dfj> = (y — //)/ 
V(/x), where V is a function for modeling the conditional variance v(x) = 
var(y|X = x) by v(x) = V(m(pc)). The quasi-likelihood for the mean re- 
gression function m is then given by X^iLi Q( m (X 4 ), Y l ), and taking into 
account the structure of the model (3.1) the quasi-likelihood for f\. is 

n 

(3.2) QGT WW+i) + • • • + f p (x; +p )), n. 

i=l 
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We take the smooth backfitting approach [Mammen, Linton and Nielsen 
(1999), Yu, Park and Mammen (2008), Lee, Mammen and Park (2010, 
2012)]. We maximize the integrated kernel- weighted quasi-likelihood 

„ n 

lq(i)= I n- i yjQ(rW2i) T x;+-'-+v%) T xy> yi ) 

(3-3) 

x K h (X. c '\z) dz 

over the tuple of functions (r/ 1 , . . . ,rj p ), each rj k being a vector of univariate 
functions that satisfy the constraints of (1.3), where X c ' 1 = (X* +1 , . . . , X* +JJ ) T 
and z = (z\, . . . , z p ) T . Here and throughout the paper, we label the elements 
of a tuple rj in such a way that 

T7(z) = (r 7l (zi) T ,...,r ? (z p ) T ) T , 

(3.4) 

Vk = \Vj,r+k ■r + kelj,l<j<d} 

with r introduced at the beginning of Section 2, and with this representation 
the constraints of (1.3) on the elements of rj are 



(3.5) 



r]ji(u)wi(u) du = 0, r + 1 < I < r + p, 1 < j < d\ 



ur]ji(u)wi(u) du = 0, r + l<j,l<d. 

We take a kernel such that 

(3.6) J Khj(u,v) dv = 1 for all values of u. 

This kernel can be constructed from the standard kernel of the form Kj lj [u — 
v) = hJ 1 K((u — v)/hj), where K is a symmetric nonnegative function, in 
such a way that Kh.(u,v) = K^.iu — v)/ j K^.iu — w)dw. It was used in 
Mammen, Linton and Nielsen (1999), Yu, Park and Mammen (2008) and 
Lee, Mammen and Park (2010, 2012), and will be used in our technical 
arguments. 

The smooth backfitting estimators of in our model (3.1) are ffc which 
maximize Lq at (3.3). In Section 4, we detail an iterative procedure to 
get the estimators. Here, we provide their theoretical properties. We begin 
with some notational definitions. We let px c denote the marginal density of 
X c = (X r+ i, . . . , X r+P ) T . Also, we let pj and pjk be the marginal densities of 
X r+ j and (X r+ j,X r+ k), respectively. Define Q r (u,y) = d r Q(g~ 1 (u),y)/du r 
and W j (z;r/) = (W i i(z;r/),...,W jp (z;T7)), where 

W ifc (z; rj) = -£[Q 2 (X T T7(X C ), m(X))X i X A T|X c = z]p Xc (z) 
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for l<j,k<p, and X = (X7,...,Xj) T . Let 

W(z;77) = (Wi(z;f/) T , . . . , W p (z; rj) T ) T = W(z; V ) T . 

Throughout this paper we write W(z) = W(z;f), where f is the true tuple 
of the coefficient functions. With slight abuse of notation, we also write 

W jk (zj,z k ;ri) = -^[Q2(X T r 7 (X c ),m(X))X i X^|X r+j = Zj ,X r+k = z k ] 

X p jk (Zj,Z k ), 

W, ; ;:,://! = -E[Q 2 (± T r,(K c ),m(X.))± j X k r \X r+j = z^zj). 
It follows that 

w jj( z j>v) = J ^jj(^,v) dz ^j,^ w jk(zj,z k ;rj) = J W jk (z; 77) dz_ {j>k ^ 

W jj( Z j) = J W ii( z ) dz -3 . Wj k (Zj,Z k ) = J Wj k (z)dz_(j >k y 

Here and throughout the paper, z_j for a given vector z denotes the vector 
without its j'th entry, and z_ij k \ without its jth and kth. entries. Due to 
the conditions (AO) and (Al), Wjj(zj) is positive definite for all Zj G [0, 1]. 
However, W(z) may not be positive definite. 

The relevant space of functions we are dealing with is the one that con- 
sists of tuples 77 of univariate functions with the representation at (3.4) such 
that its elements satisfy the constraints (3.5). For any two functions rj^ 
and j-/ 2 - 1 of this type, define (rj^, tj^)# = / t/ 1 ) t (z)W(z)t7( 2 )(z) dz when- 
ever it exists. We denote by H(W) the resulting space of tuples rj. The space 
is equipped with the inner product (•,•)#■ Let || • ||# be its induced norm, 
that is, ||77|| \ = J T/(z) T W(z)r/(z) dz. 

Define pjx(x) = <9px (x) /dx r+ j . Likewise, define (x) = 9m(x) / dx r +j 

and rrij (x) = 9 2 m(x)/(9x 2 +J -. Let A.j k denote a dj -vector such that its 
Ith element Aj k ,e = 1 if the £th element of Xj in our rearrangement (2.1) 
equals x r+k and Aj k> i = otherwise. Define a (d\ + • • • + d p )-vector A k by 
A^T = (Aj fc , . . . , Aj fc ). In the assumptions given in the Appendix, we assume 
n l '^hj — > Cj as n — > 00 for some constants < cj < 00. For such constants, 
define 

p r 
P 3 (z j ) = W n (z j y 1 p J (z J )Y,4Hb jk (X)\X r+J = z J ) / t 2 K(t)dt, 

k=l J 

where hj k (X.) are dj- vectors given by 



b ifc (X)=(mi 1) (X 



A^f(X c ) 
ff '(m(X)) 
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x, 



V(m(K))g'(m(X)) p x (X) 

Twv,/ ^'(m(X)) 



X,A,lf(X c 



+ 



5"(m(X)) 



y(m(X)) 2 5 '(m(X)) 2 F(m(X))c/'(m(X)) 3 



+ 



+ 



1 



2 F(m(X)V(m(X)) 



4 2) (x) + 



y(m(X)) 5 '(m(X)) 

^(m(X))(A^f(X c )) 2 
5 '(m(X)) 3 



Let /3*(z) = (/3*j(zj) : 1 < j < p) be a solution of 



(3.7) 



Mi' 



l < i < p, 



and put Pj{zj) to be the normalized versions of (3 jf j(zj) so that they satisfy 
the constraints (3.5). Below in a theorem we show that /3(z) = (/3 1 (zi) T , . . . , 
/3 p (zp) T ) T is the asymptotic bias of the smooth backfitting estimator f(z) = 

For the special case where the matrix W(z) is invertible, one has an 
interpretation of f3(z) as the projection of the asymptotic bias of the full- 
dimensional local quasi-likelihood estimator that maximizes 



n 



i=l 



One can check that its asymptotic bias, /3 m i t (z), is given by 

P r 
/3 mlt (z) = W(z)-W(z)E c i^( b fc( X )l XC = z ) / t 2 K(t)dt, 

where b fc (X) T = (b lfc (X) T , . . . ,b pfc (X) T ). 

The tuple j3 m \ t does not belong to H(W). The asymptotic bias (3(z) of 
the smooth backfitting estimator f(z) is identical to the projection of (3 m \ t 
onto H(W), that is, (3 = argmin^g^cvv) WPmit ~ v\\#- This projection inter- 
pretation of P(z) is not available if W(z) is not invertible. In general, one 
has to define f3(z) through the integral equations (3.7). 

For the asymptotic variance of f , we define 



CjPj(zj) 



K 2 {t)dt 



E 



X i X J 



V(m(X))g'(m(X)) 2 



X. 



r+j 
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X E 



w(X)X i XT 



y(m(X))V(m(X)) 



E\ 1 3 



X r 



-1 



y(m(X))g / (m(X)) 2 
where u (x) is the conditional variance of Y given X = x. 

Theorem 2. Under (AO) in Section 2 and those (Al)-(A5) in the 
Appendix, there exists a unique maximizer f of the integrated kernel-weighted 
quasi-likelihood (3.3) with probability tending to one. The maximizer f sat- 
isfies 

||f(z)-f(z)| 2 pz(z)dz = O p (n- 4 / 5 ), 

sup \ij(zj) - fj(zj)\ = Op(n~ 2/5 v / logn), 
Zje[2hj,l-2hj] 

where I • I denote the Euclidean norm. 



Theorem 3. Assume that (AO) in Section 2 and those (Al)-(A5) in 
the Appendix hold. Then, for all z in the interior of the support of px c > it 
follows that n 2//5 (fj(zo) — fj(zj)) are jointly asymptotically normal with mean 
((3 1 (zi) T , . . . , f3 p (z p ) ) T and variance di&g(£j(zj)). 

In the special case where m(x) = x\fi{x p+ i) + • • • + x p fd(x2 P ) for x = 
(x\, . . . ,X2 P ) T , thus the link g is the identity function and Q([i,y) = —(y — 
/i) 2 /2, the asymptotic bias and variance of ij(zj) stated in Theorem 3 coin- 
cide with those in Theorem 2 of Lee, Mammen and Park (2012). This can be 
seen by noting that W(z) is invertible in this case, and that V = g' = 1, X = 
(Xi, . . . ,X p ) T , X c = (X p+1 , . . .,X 2p ) T and W(z)- 1 ,B(XX j |X c = z)p X c(z) = 
lj where lj is the p-dimensional unit vector with the jth entry being equal 
to one. 

Theorem 3 can be also viewed as an extension of Theorem 2 of Yu, Park 
and Mammen (2008). In the latter work, smooth backfitting for the additive 
model g(m(z)) = fi(z\) + ■ ■ ■ + f p (z p ) for a link g was considered. As we 
mentioned earlier, model (3.1) reduces to the above model by taking = 1, 
Xfc = Xk = 1 for 1 < k < p and r = p. In this case, W(z) is not invertible so 
that the projection interpretation of /3(z) is not valid. If one replaces m(x) in 

the formula of ^j(zj) by 3 _1 (/(x c )) = g~ 1 {f 1 {x r+1 ) H \-f p (x r+p )), W jk 

by the corresponding quantities for the latter model, which are 



W jk ( Zj ,z k ) = J F( 5 - 1 (/(z)))- 1 5 / ( 5 - 1 (/(z)))- 2 p X c(z)dz_ (i)fe) 
W jj (z j ) = j V(g-\f{z))y 1 g\g-\f(z)))- 2 p^)dz- j , 
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then one can verify that the solution of the system of integral equations 
in (3.7) concides with the asymptotic bias in Yu, Park and Mammen (2008). 
The asymptotic variance 'Sj(zj) given above also reduces to the one in Yu, 
Park and Mammen (2008). 

Remark 1. In the case where the link g in model (3.1) is the iden- 
tity (or a linear) function and when the covariates Xj are independent, 
one may apply marginally a kernel smoothing method to estimating each 
coefficient function. To see this, suppose that all Xj contains 1 as their 
first entry and any entry of Xj does not equal to any of X r+ k, k 7^ j. 
Then, E(Y\'X.j, X r+ j) = Xj[fj(X r+ j) + cj] for some constant vector Cj. This 
means that fj minimizes E[Y — X.Jr]j(X r+ i)] 2 over r/j subject to a nor- 
malization. Thus, the marginal smoothing that minimizes n~ l X^I=i(^ 1 — 
r)J~K'j) 2 Kh j (X* + j, Zj) for each j and each point Zj gives a consistent esti- 
mator of ij(zj). This marginal smoothing approach is not valid, even with 
independent covariates, in case the link function g is nonlinear. In the latter 
case, one needs to use a projection method such as the smooth backfitting 
defined above, or a marginal integration technique, to obtain appropriate 
estimators. 

4. Implementation. In this section, we discuss how to find f maximiz- 
ing Lq at (3.3). Our method of finding f is based on an iteration scheme. 
By considering the Frechet differentials of Lq, we see that 

« n 

I n^ 1 ^Q 1 (X iT f(z),Y i )X;.K h (X c ' i ,z)dz„ J = 0j, 

(4.1) 

l<j<p,Zj€[0, 1], 

where Oj is the zero vector of dimension dj. The system of equations is 
nonlinear. We take the Newton-Raphson approach to find a solution by it- 
eration. For a vector of functions (r]J, . . . ,r]J) T where Tfj( z ) = Vji z j)i define 

/n 
i=i 

The system of equations in (4.1) is then expressed as Fj(i) = Oj, 1 <j <p- 
Our algorithm runs an outer iteration which is based on a Newton-Raphson 
approximation of the system of equations. Each outer-step solves a linearized 
system of equations to update the approximation of f , which requires an 
additional iteration, called inner iteration. 

To describe the algorithm, suppose that we are at the sth outer-step to 
update f I s " 1 ] in the previous outer-step. Considering the Frechet differentials 
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of Fj at f[ s_1 ], we get the following approximation: for 1 < j < p, 
Fj(f)( Zj ) ^F^-V)^) 

V » n 

(4.3) +Y, / n- 1 ^Q 2 (X iT f[ s - 1 l(z),Y i ) 

k=l J i=l 

x X^ T [f fe (^)-fJ s " 1] (^)]K h (X c - i ,z)dz_ i . 

This gives an updating equation for fW. Define, for 1 < j <p and for 1 < 

j + k < p, 

r. n 

Wg(^,^) = - / n- 1 ^Q 2 (X iT f[ s ]( Z ),F i )X}Xt T J Pr h (X c ' i ) z)dz_ ( ,- fc) , 

t=l 

w§ (*,•) = -J n ~ 1 Y, Q2(x tT f w ( Z ), r )x;xf^ h (x c ' 4 , z) dz-j. 



Define 5^) = [wg(^]~%(f [s] )(%)- Also, let sf( Zj ) = fj a] ( 

f| s ^(-Sj). We get from (4.3) the following linearized system of updating 
equations: 

s;* l fe)=^- 1| fe)-yj Awi*- 1| ( 2j )]- i wi* t - 1| ( % -,z t )iL* 1 («)d Zi , 

(44) 

i < j <p- 

Solving the system of equations (4.4) for Sj ' and then updating fj s ^ by 

fj^ = f j s ^ + 5j ' constitutes the sth step in the outer iteration. Below in 
Theorem 4, we show that the outer iteration converges to a solution that 
satisfies (4.1). 

The system of equations (4.4) cannot be solved since to get <5 ? - one re- 

quires knowledge of the other d k ,k^j. To solve (4.4) we need an inner 
iteration. Suppose that we are at the £th inner-step of the sth outer-step to 

- [8,^—1] 

update Sj' , 1 < j < p, in the previous inner iteration step. We apply (4.4): 

xS k (z k )dz k , l<j<p- 

Existence of a unique solution of (4.4) and the convergence of the inner 
iteration to the solution are demonstrated below in Theorem 4. For the 
starting values Sj ' in the inner iteration of the sth outer-step, one may use 

i- ■ . . ^[s— l.oo] 

the limit of the inner iteration m the previous outer-step dj 
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For a convergence criterion of the outer iteration, one may check whether 
the values of the left-hand side of (4.1) are sufficiently small, or use the 

difference between the two updates ff 8 1 ' and fjf': 
(4.6) f\^\z k )-t [ °- l \z k )\ 2 dz k . 

In the latter case, one should use the normalized versions of the updates. 
Recall the configuration of rj k in (3.4). The normalized version of a given 
set of tuples rj^ may be obtained by the following formulas. Let the weight 
functions wi be normalized so that J wi(u) du= 1. Then 



rjji(u) =rjji*(u) 



(4.7) 



rj*ji(u)wi(u) du, 



l<j<rord+l<l<r+p, 



Vjl(u) = Vjl*( u ) ~ a jl ~ b jl u > r+l<j,l<d, 



where 



a,ji = J T)*ji(u)wi(u) du-bji J uwi(u)du, 



u — / twi(t)dt\ wi(u)du 



-1 



x j \u — j twi(t) dt\ri*ji(u)wi(u) du. 



One should also use the normalized Sji for the convergence of the inner 
iteration. 

Theorem 4. Assume that (AO) in Section 2 and (Al)-(A5) in the 
Appendix hold. Then there exist constants < C\ , r < 00 and < 7 < 1 such 
that, if the initial choice f ^ satisfies 



(4.8) 



||P(z) 



f(z)|W(z)dz<r 2 



with probability tending to one, then J |fM(z) — f (z)| 2 px c ( z ) dz < Ci4 _( - s_1 - ) 
7 2 _1 with probability tending to one. Also, for each outer-step there exists 
a solution of the system of equations (4-4) that is unique, and the inner it- 
eration converges at a geometric rate. Furthermore, if the initial choice f ^ 
satisfies (4-8) with probability tending to one, then there exist some con- 
stants < C2 < 00 and < p < 1 such that, with probability tending to one, 

J '^(z) — S ' \z)\ 2 px<={z) dz < C2P 2i for sufficiently large s, where ^ s '°°' 
is a solution of the system of equations in (4-4)- 
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The theorem shows that the number of iterations that is needed for a de- 
sired accuracy of the calculation of the backfltting estimator does not depend 
on the sample size. If the desired accuracy is of order n~ c for some constant c, 
then a logarithmic number of iterations suffices. Thus the complexity of the 
algorithm only increases very moderately for increasing sample size. We have 
no good bound on the required accuracy of the starting values, that is, on 
the choice of r. In our practical experience the algorithm was very robust 
against poor choices of the starting value. In fact, in the simulation study 
we chose f = and it worked quite well. A more deliberate choice is a ver- 
sion of the marginal integration estimator studied by Yang et al. (2006), 
or a spline estimator that we discussed in Section 2. These are consistent 
so that they satisfy the condition (4.8), but they cost additional numerical 
computation. 

As for the choices of the bandwidths hj, one may estimate the optimal 
bandwidths h° pt = c*n~ 1 ^ , where c* = (c*, . . . , c*) is defined by 

P r 

(4.9) c* = argmin^ / [\Pj(zj, c)| 2 + trace(£,(zj, Cj))]p Zj (zj) dzj. 

Here, we write f3j(zj,c) and Hj(zj,Cj), instead of (3j(zj) and Hj(zj) as 
defined in Section 3, to stress their dependence on the vector of the band- 
width constants c = (ci, . . . , c p ). To describe a simple plug-in method, get 
parametric estimates of fjf. by maximizing (3.2) over the class of pth order 

polynomials fjk( x ) = a fk + a jk + ' — l~ a p! xP > ana - obtain a kernel estimate 
of the density px- Then one can estimate 0j by plugging these estimates into 

the formulas of j3j,Wjj,Wjk (j / k) and solving the system of equations 
in (3.7) by iteration. One can also estimate Sy. Put these estimates of /3 • 
and 5L into the right-hand side of (4.9) to get an estimate of c*. A sim- 
ilar idea was adopted by Lee, Mammen and Park (2012) for the classical 
varying coefficient model. An alternative way is to develop a method simi- 
lar to the penalized least squares bandwidth selector proposed by Mammen 
and Park (2005). This would need higher-order stochastic expansions for 
the quasi- likelihood of the smooth backfltting estimators. Finally, we want 
to mention that f3j depends on the whole vector c contrary to , the latter 
only involving Cj. This is not the case with the local linear smooth backflt- 
ting where both depend on Cj only, see the next section. Thus, a grid search 
for c* at (4.9) may be computationally expensive for large p. In this case, 
one may apply an iteration scheme which, in each iteration step, updates c,- 
for 1 < j < p one by one with the other c/j , k ^ j , being held fixed at the 
values obtained in the previous step. 

5. Extension to higher-order local smoothing. In the previous two sec- 
tions we considered smooth backfltting based on Nadaraya- Watson smooth- 
ing. Here, we discuss its extension to local polynomial smoothing. We focus 
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on the local linear case. The extension to the general case is immediate, but 
needs more involved notation. For a function rj of interest, the basic idea 
of local linear smoothing is to approximate rj(u) for u near a point z by 
rj(z) + 7]'(z)(u — x), where rf is the first derivative of rj. Thus, we maximize 

J i=l 

x K h (X. c >\z)dz, 

where r]j{zj,X % r+ -) = r) j0 (zj) + r)j l (zj)Kj 1 {X' l r+ - - Zj). The maximizers, de- 
noted by fjo and fji which correspond to rjj and rjji, respectively, are the 
estimators of the true fj- and hjfj, where fj is the vector of the first deriva- 
tives of the entries in fj. Again, fjo should be normalized according to (4.7). 

To describe the algorithms, write fj = (f^,f^) T . They satisfy Fj(fi, . . . , 
f„) = Oj, 1 < j < p, where Oj denotes now the zero vector of dimension 2dj, 

F j ( Vl ,...,r 1p ) = J n-^Qi^Xf^^-,^.),^ 

x a(X* +i , Z j) ® X;-K h (X c ' 1 , z) dz_j 

rjj = (tjJ , vji) T an d 8L(X* + j,Zj) = (1, hj 1 - Zj)) T . The expressions for 

the updating equations at (4.4) and (4.5) are unchanged if, writing A* fc = 
&(X l r+j , Zj)a.(X l r+j , z k ) T , we redefine W^j, l<j,k<p, by 

7 i=i \j=i ) 

x A;. fc ^x;.x^ T ^ h (x c ' i ,z) ( iz_ . fc) , 

7 i=i \j=i / 

x A)j <g> X;xf K h (X c ^, z) £&_,-. 

Let Sj be defined as in Section 3, and define (3j L (zj) to be the normalized 
versions of cj f t 2 K(t)dt^'(zj)/2 obtained by (4.7), where fj' is the vector 
of the second derivatives of the entries in ij . 

Theorem 5. Under (AO) in Section 2 and (A1)-(A5) in the Appendix, 
Theorems 2 and 4 remain valid for the local linear smooth backfitting esti- 
mators (fjo,fji) and for their algorithms, respectively. As a version of The- 
orem 3, n 2 ^ 5 (fjo(zj) — fj(zj)) are jointly asymptotically normal with mean 
(Pf L (zi) T , . . .,(3p L (z p ) T ) T and variance diag(Sj-(%)). 
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6. Simulation study. In the simulation study, we considered a binary 
response Y taking values and 1, and took the following model for the 
mean function m(x): 

g(m(X)) = f 02 (X 2 ) + f 03 (X 3 ) + Xx(f 12 (X 2 ) + f 13 (X 3 )) 

(6.1) 

+ ^3/32(^2) + -^2/23(^3)5 

where g(u) = log(u/(l — u)) is the logit link and fo 2 (z) = z 2 ,fo 3 (z) = 4(z — 
0.5) 2 ,/i 2 (z) = z,f l3 (z) = cos(27r z),f 32 (z) = e 2z -\f 23 (z) = sin(27rz). The co- 
variate X\ was a discrete random variable having Bernoulli(0.5) distribution, 
and X 2 and X 3 were uniform(0, 1) random variables. The three covariates 
were independent. We chose two sample sizes n = 500 and 1000. The number 
of samples was 500. For the initial estimate, we used f ^ = 0. The weight 
functions wi were W[{z) = I\q : i](z) for all I. We used the Epanechnikov kernel 
function K(u) = (3/4) (1 — u 2 )I^i^(u) and took the theoretically optimal 
bandwidths as defined at (4.9), which were /i° pt = 0.4328, h° 2 vt = 0.2789 for 
n = 500, and /i° pt = 0.3768, h 2 pt = 0.2428 for n = 1000, in our simulation 
setting. 

In the simulation, we also computed the cubic spline estimates with K 
knots placed evenly on the interval [0,1]. We used the power basis for cu- 
bic splines: s (z) = l,st(z) = z,s 2 (z) = z 2 ,s 3 (z) = z 3 ,s 3+k (z) = (z - i k )\, 
where are the knot points. If one applies directly the power basis to the 
model (6.1), one may suffer from "near singularity" of the resulting design 
matrix. This is because the functions fjk without satisfying our constraints 
are not identifiable. Taking into consideration the constraints, we adjusted 
the power basis so that s\ is orthogonal to so, and Sj for 2 < j < K + 3, 
are orthogonal to sq and s\. The dimension of the power basis for the cu- 
bic spline approximation of the model (6.1) equals 6K + 19. The number of 
knots taken was K = 1 which gave the best performance. The performance 
of the spline estimators got worse quickly as K increased. 

Table 1 shows the results based on 500 datasets. For each component 
function fjk, the table provides the integrated mean squared error (IMSE), 
J E[fjk(z) — fjk(z)] 2 dz. The main lesson is that the spline estimators have 
much larger variances than the smooth backfitting estimators, while the 
former have smaller biases. Overall, the smooth backfitting method works 
quite well. Comparing the values of IMSEs for the two sample sizes, the 
results for the smooth backfitting method reflect the asymptotic effects fairly 
well. Note that the theoretical reduction of IMSE from n = 500 to n = 
1000 equals (0.5) 4//5 ~ 0.574. In the simulation we also found the iterative 
algorithm of the smooth backfitting method in Section 4 converged very fast. 
The outer loop typically converged in five iterations with the criterion value 
10 -4 for the normalized difference (4.6), and that the inner loop converged 
in three iterations. 
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Table 1 

Integrated mean squared errors (IMSE), integrated squared biases (ISB) and 
integrated variance (IV) of the two methods, cubic spline (SPL) and 
smooth backfitting (SBF), for the model (6.1) 











fo2 


/l2 


/32 


/03 


/is 


/23 


SPL 


n = 


= 500 


IMSE 
ISB 
IV 


0.2607 
0.0013 
0.2594 


0.2145 
0.0004 
0.2141 


0.5034 
0.0011 
0.5022 


0.2631 
0.0006 
0.2624 


0.2274 
0.0044 
0.2230 


0.5857 
0.0159 
0.5699 




n — 


1000 


IMSE 
ISB 
IV 


0.1106 
0.0001 
0.1105 


0.0817 
0.0004 
0.0813 


0.2122 
0.0001 
0.2121 


0.1074 
0.0001 
0.1073 


0.0938 
0.0006 
0.0932 


0.2453 
0.0104 
0.2349 


SBF 


n = 


= 500 


IMSE 
ISB 
IV 


0.0315 
0.0035 
0.0280 


0.0399 
0.0112 
0.0288 


0.0274 
0.0128 
0.0147 


0.1071 
0.0090 
0.0981 


0.1073 
0.0543 
0.0531 


0.1685 
0.0808 
0.0877 




n — 


1000 


IMSE 
ISB 
IV 


0.0214 
0.0021 
0.0193 


0.0210 
0.0057 
0.0153 


0.0254 
0.0107 
0.0147 


0.0526 
0.0062 
0.0464 


0.0702 
0.0384 
0.0318 


0.1103 
0.0544 
0.0559 



We also investigated how the additional terms in the modeling (1.1) af- 
fected the estimation precision when the true model was given by 

#(m(x,z)) = xifi(zi) H h x d f d (z d ) 

for a set of covariates (X±, . . . , Xj; Z\, . . . , Zd). In the latter model, each co- 
variate appears in only one nonlinear interaction term. For this, we estimated 
the following model: 

g(m(X, Z)) = /oi(Zi) + f 02 (Z 2 ) + X 1 {f 11 {Z 1 ) + f 12 (Z 2 )) 

(6.2) 

+ X 2 (f 2l (Z l ) + f 22 (Z 2 )), 

where f 01 (z) = f 02 (z) = 0,fu(z) = cos(2vrz), f 12 (z) = 0,f 21 (z) = 0J 22 (z) = 
sin(2-7rz), and the link g was the same as in the first example. The covari- 
ate X\ was a discrete random variable having Bernoulli(0.5) distribution, X 2 
was the standard normal random variable and Z\ and Z 2 were uniform(0, 1) 
random variables. The four covariates were independent. The theoretically 
optimal bandwidths as defined at (4.9) were h° pt = 0.2405, h° pt = 0.2469 for 
n = 500 and /i° pt = 0.2093, h° 2 pt = 0.2149 for n = 1000, and we used these in 
the simulation. 

The main purpose of this additional simulation is to compare our estima- 
tors based on the working model (6.2) with the "oracle" estimators which 
use the knowledge that foi(z) = fo 2 (z) = fi 2 (z) = fei(z) = 0. The system of 
updating equations for the oracle estimators in our setting is given by (4.4) 
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Table 2 

Comparison of the smooth backfitting estimators under the extended model (6.2) and the 

corresponding oracle estimators 



fll f22 



Oracle 


n = 500 


IMSE 


0.0680 


0.0639 


SBF 




ISB 


0.0285 


0.0421 






IV 


0.0395 


0.0218 




n = 1000 


IMSE 


0.0400 


0.0433 






ISB 


0.0183 


0.0309 






IV 


0.0216 


0.0124 


SBF 


n = 500 


IMSE 


0.1057 


0.0638 


based on (6.2) 




ISB 


0.0273 


0.0408 






IV 


0.0784 


0.0230 




n = 1000 


IMSE 


0.0627 


0.0427 






ISB 


0.0180 


0.0299 






IV 


0.0447 


0.0128 



with the following modifications of Wj k (zj, Z/.) and Fj(rj)(zj): for j ^k, 

n 

W^{z 3 ,z k ) = -n- 1 Y,Q2{Xlf^(z l ) + Xlf^ 

i=l 

Wf}{ zj ) = - [ n- x Y j Q 2 {X\f^{z x ) + Xif^{z 2 ),Y i ) 
J i=i 

x (Xj) 2 K h (Z 4 ,z)dz„ i , 

/n 
n- 1 Y,Qi(Xlvii(z 1 ) + X t 2m2 (z 2 ),Y i )X l j K h (Z l ,z)dz^, 
i=i 

where Z* = (Zf,Z£) T , z = (zi,z 2 ) T and rj(z) = (j?h(^i),?/22(-2 ; 2)) T - Note that 
all these terms are a scalar, not a matrix or a vector. 

Table 2 shows the results based on 500 datasets. For each of the nonzero 
component functions, the table provides ISB, IV and IMSE. We see that the 
smooth backfitting estimators perform fairly well in comparison with their 
oracle versions. In particular, both have nearly the same IMSE, ISB and IV 
for the estimation of the second component function /22- For estimating fn, 
the smooth backfitting procedure with the extended model (6.2) gave almost 
the same bias as the oracle procedure, but a larger variance than the lat- 
ter. This may be expected since the former has the additional component 
function f± 2 in the estimation. This was not the case with the estimation 
of /22j however. The main reason is that the variances of the estimators 
depend highly on the design of the regressor X 2 . Recall that in parametric 
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linear regression the variance of the least squares estimator of a regression 
coefficient gets smaller as the corresponding regressor is more variable. In 
our setting, the variability of X2 is four times as high as that of X\. This 
relatively high variability of X2 alleviated the extra sampling variability of 
the SBF estimator under the model (6.2). 

APPENDIX: TECHNICAL DETAILS 

A.l. Proof of Theorem 1. The statement of Theorem 1 follows immedi- 
ately from the following lemma. 

Lemma 1. Under assumption (AO), it holds that there exist constants 

< C\ < C2 such that for two tuples (a, fjf. : 1 < j < d, k £ Ij) and (a* ,f* k : 

1 < j <• d,k € Ij) it holds that 

d [lg(m(x))-g( m *(x))] 2 P x (dx) 



(A.l) < I a - a*\l + Yl ifjk( x k) ~ fjk( x k)] 2 Px k (x k ) dx k 



j=i kei 



<C 2 f [<?(m(x)) - 5 (m*(x))] 2 P x (dx) 
Here, px k is the density of X k , and 



led 2 . 



Vj=i j<k ) 



j,k&Co 

d 

(A.2) #(m(x)) = Y a i x j + X] a jk x i x k + x 1 fi(x r+ i) H h f p (x r+p ) , 

j=l j<k 



d 

#(m*(x)) = Y a j x j + 5^ ot* jk XjX k + ScJ{i(x r+ i)-\ \-k p v f*(x r+Pj 



j=l j<k 

j,kec 

Proof. We only prove the second inequality of (A.l). The first one fol- 
lows by direct arguments. We first observe that because of assumption (AO) 
it holds that for constants C\,C2 > 0, 

[ 5 (m(x))- 5 (m*(x))] 2 P x (<ix) 

D 

> Cl / b(m(x)) -sKfx))] 2 ]]?!,^) 
J 1=1 
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>c 2 [g(m(x)) - g(m*(x.))] 2 Y[wi(xi)dxiY[Pxi(dxi). 

lec lie 

Denote by I the right-hand side of the second inequality. Due to the con- 
straints of (1.3) and the fact that x& does not include x r+k , those terms 

d 

^2(a>j - a*)xj + ^2 ( a jk- a* jk )xjX k , 

j=l j<k 

j,kec 

xj(fi(x r+1 ) - f*(x r+ i)), . . . ,xj(fp(av +p ) - f* (x r+p )) 
are orthogonal in L2 (/•*)) where \x is the product measure defined by //(dx) = 
rijec w j( x j) d x j Tlj<£c (dxj). By this and by making use of (AO) again, 
we get 



C2 



y](°tj - a*)xj + (a jk ~ a* k )xjX k 



■3=1 



j<k 

j,k£C 



~[ wi(xi) dxi P Xl (dxi) 
lec lie 



+ C2 ^2 / ( ¥ k( X r+k) - ^k{ x r+k))fWwi{xi)dxiY\P Xl {dxi) 



k=l 

\ I * 1 2 

> c 3 |o; - a|, 



zee 



+ C3 X] / ( f fc( x ^+fc) - f fc( 2; r+fc))] 2 J Px r+fe (d2;r+fc) JJPjr,(da;i) 
k=i J lei* 

for some constants C3 > and where It denotes the set of indices of x& . The 
second inequality of (A.l) now follows because the smallest eigenvalues of 
f itkxj Yiiei* Pxt(dxi) can be bounded from below by a positive constant 

times the smallest eigenvalue of E^XfcXjT] = f x^x^P-j^ (dx^), where P^ 

denotes the distribution of X^. These eigenvalues can be bounded away 
from zero by assumption (AO). □ 

A. 2. Proof of Corollaries 1 and 2. For the proof of these two corollaries, 
we apply Theorem 1. We have to show that (2.4) holds with n n = n l ^ 2l+1 ' 
for the penalized least squares estimator and the spline sieve estimator, 
respectively. 

For the proof of Corollary 1, we apply Theorem 10.2 in van de Geer (2000). 
As discussed in van de Geer (2000) the statement of the theorem remains 
valid for errors with sub exponential tails if the entropy bounds hold for en- 
tropies with bracketing. For the application of this theorem one needs results 
on the entropy with bracketing for the class of functions m that fulfill (1.4) 
with fjk in Sobolev classes. Because g has an absolutely bounded derivative, 
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Lemma 1 implies that the well-known entropy conditions for Sobolev classes 
carry over to the classes of functions m. This proves Corollary 1. 

For the proof of Corollary 2 we use Theorem 1 in Chen and Shen (1998). 
Compare also Theorem 10.11 in van de Geer (2000). Using the above entropy 
bound one can easily verify Conditions A.1-A.4 in Chen and Shen (1998) 
with 1(9, (X,Y)) = (m(X) - Y) 2 , 9 = (a,f jk ;l <j<d,k£ Ij) and m as 
given at (A.2). Note that 1(9, (X,Y)) - l(9 , (X,Y)) = (m(X) - m (X)) 2 - 
2(m(X) — mo(X))e, where 9q = (&o, fojkj 1 < 3 < d, k £ Ij) is the true tuple, 
mo denotes the true underlying regression function and e = Y — too(X). 
To check the conditions compare also the proof of Proposition 1 in the 
latter paper. In particular, their condition A. 4 holds with s = 21/(21 + 1). 
This follows because for two functions gi,g 2 : [0, 1] — > R with \D z g\(z)\ < L, 
\D l z g 2 (z)\ <L and ^(g x (z) - g 2 (z)f dz <8 2 , it holds that \ gi (z) - g 2 (z)\ < 
2(2L) 1 - C 5 1 ~ C with c = 21(21 + l)" 1 ; see Lemma 2 in Chen and Shen (1998). 
The necessary conditions are simplified because we assume that the data are 
i.i.d.; see also Remark 1(b) in Chen and Shen (1998). To get e ( n s)/(7 l) B n > 
1 at the end of their proof of Theorem 3 one needs that (2 — s)/(7 — 1) < s 
which is equivalent to 7 > 2 + I . One can check that their proof goes 
through with this constraint. Thus it suffices for the i.i.d. case that -E|e| 7 < 
00 holds for some 7 > 2 + I . 

A. 3. Additional assumptions for kernel smoothing. We assume the den- 
sity of X c is supported on [0, l] p . Thus, the integration at (3.6) is over [0, 1]. 
We note that, for the normalized kernel Kh.(v,j,Zj) introduced in Section 3, 
[2hj,l — 2hj] is the interior region for Zj that does not have a boundary 
effect. In addition to assumption (AO) in Section 2, we collect the conditions 
we use for the theory in Sections 3 and 5. 

(Al) The quasi-likelihood function Q(fj,,y) is three times continuously 
differentiable with respect to \x for each y in its range, Q 2 (u, y) < for ti£l 
and y in its range, the link function g is three times continuously differen- 
tiable, V is twice continuously differentiable and the conditional variance 
function f(x) = var(Y|X = x) is continuous in x c = (x r +i, . . . , x r + p ) T for 
each (x\, . . . ,x r ). The densities pxj for r + 1 < j < r + p are bounded away 
from zero on [0, 1]. The function V and the derivative g' are bounded away 
from zero. The higher-order derivatives g" and g'" are bounded. The weight 
function w is continuously differentiable and fulfills w(0) = w(l) = 0. 

(A2) The partial derivatives dpx(x)/dx c of the joint density function px 
exist and are continuous in x c for all (x\, . . . ,x r ). 

(A3) The components of fj are twice continuously differentiable. 

(A4) E\Y\ a < 00 for some a > 5/2. 

(A5) The kernel function K is bounded, symmetric about zero, has com- 
pact support, say [—1,1], and is Lipschitz continuous. The bandwidths hj 
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depend on the sample size n and satisfy n 1 / 5 /^- — > c^- as n — > oo for some 
constants < Cj < oo. 

A. 4. Preliminaries for the proofs of theorems 2 4. The population ver- 
sions of Fj are defined by 

Fj(v)(zj) = J S[Qi(X T 7 7 (z),m(X))X J |X c = z]p XC (z)dz_ J . 

For the empirical versions of 'Wjk(z;rj), Wj(z;rj) and W(z;?7) introduced 
in Section 3, we define 

n 

W jfc (z;r ? ) = -n- 1 ^Q 2 (X iT r ? (z),y J )X;.Xl T ^ h (X^,z), 

i=l 

and then define Wj (z; rj) and W(z; rj) in the same way as we define Wj (z;ij) 
and W(z;?7), respectively. We write Wjfc(z) = Wjfc(z;f) in case the true f 
enters into the place of rj. 

For a tuple S £ H(W), let F'^rj)^) denote the Frechet differential of Fj 
at rj to the direction of d. Then 

F' j ( V )(5)(z j ) = -J W j (z; V )S(z)dz^ j . 

The second term on the right-hand side of (4.3) is simply Fj(f[ s_1 l)(f — 
f[ s-1 l)(zj). The population versions of F^ry) are defined by F'j(r])(d)(zj) = 
— j Wj(z; ?7)<5(z) dz_j. Define a linear operator F'(rj) by 

F'( V )(6) = ((F[( V )(S)) T , . . . , (E» (S)) T ) T . 

Likewise, define F'(r/) from F'j(r]). In the proofs below, we use f = (f^, . . . , f^) 
to denote the true vector of univariate functions. 

A. 5. Proof of Theorem 2. In addition to || • ||# introduced in Section 3, 
we consider two other norms. Let || • || 2 be the L 2 (px c )~ norm defined by 
Nil! = / l ? ?( z )| 2 Px-(z) dz. Define [|t|[|oo = ^^{^P2hi<z!<l-2hi l*h(*OI> ■ • • > 
su P2/ip<z p <i-2hp l^pC^p)!}; where | • | denotes the Euclidean norm. As in Sec- 
tion 3, we write W(z) = W(z;f), Wjj(zj) = ~Wjj(zj;f), etc., for the true 
tuple f . For a linear operator T that maps %(W) to Ti(W), let || T\ ^ denote 
its operator-norm defined by H-FHop = sup|| 4 || =1 ||.F(<5)||- Here and below, if 
not specified, || • || is either || • || 2 or || • ||oo. We prove 

(A.3) P(F'(f) is invertible and ||F'(f) -1 || op < C x ) -> 1, 

(A.4) P(||F'(r7) - F'(r/)ll op < C 2 \\ V - rj'\\ for all V , r,' G B r {f)) -> 1, 

for some constants r, Ci,C 2 > 0, where B r (f) is a ball centered on f with 
radius r. Then, the theorem follows from Newton-Kantorovich theorem 
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[see, e.g., Deimling (1985)] since ||F(f)|| 2 = O p (n~ 2 /- J ) and ^(f)^ = 
Op(n~ 2 / 5 v / logn). 

By the standard techniques of kernel smoothing, one can show that, uni- 
formly for z G [0, l] p , W jk (zj,z k ) = / Wj fc (z) dz_ (j]A .) converges to W jk (zj,z k ) 

and Wjj(zj) = J Wjj(z) dz_j to Wjj(zj), for 1 < j ^ k < p. This gives the 
uniform convergence of F' (f)(5) to F' (f)(5) over 5 such that ||5|| < i?, where 
R > is an arbitrary positive real number. Thus, to prove (A. 3) it suffices 
to show that F'(f) is invertible and has a bounded inverse. For this claim 
we first show that the map F'(f) :T-t(W) — > ~H(W) is one-to-one. Suppose 
that F' (f)(5) = for some 5 G H(W). We have to show that 5 = 0. From 
F'(f)(5) = we get that S j {z j ) T J Wj(z)5(z) dz_j = for all l<j<p. This 
implies 



dZj 



= E/ /^(^) T /w,(z)5(z)dz_ 

(xj 5 j (x r+J - ) ) (x J 5 fc (x r +k ) ) -Px (cfec) 



p p 




for some positive constant c > 0. Here, for the inequality we used that 
V(u)g' (u) 2 is bounded from above for u in any compact set. Applying the 
arguments in the proof of Lemma 1, we get that 5 = a.s. This proves the 
claim that the map F'(f) :T~L(W) — > 'H(W) is one-to-one. 

Next, using the fact that (F' (f)(5), r/) # = <5,F'(f)(»7)) # for all S,r] G 
~H(W), one can show that F'(f) is onto. Thus, F'(f) is invertible. To verify 
that F'(f) -1 is bounded, it suffices to prove that the bijective linear operator 
F'(f) is bounded, due to the bounded inverse theorem. From the assump- 
tion (Al) and the fact that support of the density of X is bounded, we 
get 

l|F'(f)(5)|| 2 # = J |F'(f)(5)(z)| W(z) dz < C 3 ||5|||, 

||F'(f)(5)|| 0O <C 4 ||5|| 0O 

for some constants C3, C4 > 0. This concludes that F'(f ) is bounded in both 
of the norms || • H2 and || • ||oo. 

The claim (A. 4) holds since, for any given r > 0, there exists a constant 
C5 > such that, with probability tending to one, ||F'(r/)(5) — F'(t/)(5)|| < 
CsWv-v'W ■ for all rj,rj' G B r (f). 
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A.6. Proof of Theorem 3. Let d denote a solution of the following equa- 
tions: 

(A.5) 

- ^ J ^jj{z j )~ ll W jk (z j ,Zk)8 k (z k )dz k , l<j<p, 

where Sj(zj) = Wjj(zj) _1 Fj(f)(zj). We first remark that S exists and is 
unique, with probability tending to one. Define f = f + d. We claim 

(A.6) ||f-f|| 00 = O p (n- 2 / 5 v / lo^). 

Define Fj for 1 < j < p by 

F j ( V )(z j ) = F j ^)(z j ) 

P , n 

+ E/^ 1 E^(x" T f(z),y J ) 

k=l i=l 

x X_)jC k T l Vk (z k ) - f fc (z fc )]K h (X c >\z)dz„j. 

Note that f is the solution of the system of equations Fj (ij) = Oj , 1 < j < p by 
the definitions of 5j and Sj. Thus, the claim (A.6) ensures that Fj(f)(zj) = 
Fj(i)( Zj ) + o p (n~ 2 / 5 ) = o p (n~ 2 / 5 ), uniformly for Zj £ [2hj,l - 2hj]. Also, 
(A.6) and (A. 4) give (A. 3) with f being replaced by f. This establishes 
||f ~~ f ||oo = o p (n _2//5 ). The theorem follows since 

n 2 / 5 S(z) A iV((^(^) T , . . . , (3 p (z p ) T ) T , diag(£j(*j))), 

the latter being proved similarly as in the proof of Theorem 2 of Lee, Mam- 
men and Park (2012). 

It remains to prove (A.6). The Frechet differential of Fj at rj to the 
direction S, which we denote by F'j(rj)(S), does not depend on r\ since F j{r\) 
is linear in r). In fact F'j(ri)(S) = F^(f)(<5) for all rj. This means that F(f) = 

F(f) and F'(rj) = F'(f ) for all r], so that (A.3) and (A.4) are valid for F. As 
in the proof of Theorem 2, this implies (A.6). 

A. 7. Proof of Theorem 4. An application of Newton-Kantorovich theo- 
rem gives the first part of the theorem. For the proof of the second part of 

- [s,£] 

the theorem, we rewrite a full cycle of the iteration step in (4.5) as S ' = 
S^ + ' + A[ s-1 ]<^ ' ' with appropriate definitions of s\_ ' and yl[ s-1 l. Note 

*[s— 1] ~[s— 1] 

that S + differs from the tuple with elements Sj . Also, we can write 

a full cycle of the iteration step for solving (A.5) as S = <5+ + AS with 
appropriate definitions of <5 + and A. Finally, we can wr ite S [e] = S + + AS [e ~ 1] 
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with appropriate definitions of <5+ and A for a full cycle of 
df(z j )=W jj (z j )- 1 F j (f)(z j ) - W [W^C^)]-^^,^)^" 11 ^)^, 

i < i < p. 

For the convergence of the last iteration, we note that the projection opera- 
tors TTkj '■ — >• 7^j(W) for all 1 < j ^ k <p are Hilbert-Schmidt, where 
Hfc(W) is a subspace of 'H(W) such that 77 £ 'Hfc(W) if and only if 77^ = 
for all Z 7^ fc, and elements of ry fc with the configuration at (3.4) satisfy the 
constraints (3.5). This implies there exist constants Cq and < po < 1, with 

(A.7) y [S [e] (z)-S [oo] (z)] T W{z)[S [e] (z)-d lco] {z)]dz<C p 2 e 

for some limiting function S^. 

We now apply ||iL — A\\ = o p (l), \\6 + — 5 + \\ = o p (l) and sup s \\AW —A\\ < c, 

- [s] 

sup s \\S + — S+\\ < c with probability tending to one, for some constant c> 
that can be made as small as we like by choosing r small enough. This and 
equation (A.7) implies that for some constants C* and < p* < 1, 

/ [<5 M (z) " S [s '°°\z)] T W(z)[S [s '\z) - S^iz)] dz < C* P f, 

with probability tending to one. This completes the proof of Theorem 4. 
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